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Abstract 

In this paper we discuss A-policy iteration, a method for exact and approximate dynamic programming. 
It is intermediate between the classical value iteration (VI) and policy iteration (PI) methods, and it is closely 
related to optimistic (also known as modified) PI, whereby each policy evaluation is done approximately, using 
a finite number of VI. We review the theory of the method and associated questions of bias and exploration 
arising in simulation-based cost function approximation. We then discuss various implementations, which 
offer advantages over well-established PI methods that use LSPE(A), LSTD(A), or TD(A) for policy evaluation 
with cost function approximation. One of these implementations is based on a new simulation scheme, called 
geometric sampling, which uses multiple short trajectories rather than a single infinitely long trajectory. 


1. INTRODUCTION 

Approximate dynamic programming (DP for short) has attracted substantial research interest, and has a 
wide range of applications, because of its potential to address large and complex problems that may not 
be treatable in other ways. The literature on the subject is very extensive, and includes several textbooks, 
research monographs, and surveys that relate to the computational context of this paper. For a nonexhaustive 
list, we mention the books by Bertsekas and Tsitsiklis [BeT96], Sutton and Barto [SuB98], Gosavi [GosOS], 
Gao [Gao07], Chang, Fu, Hu, and Marcus [CFH07], Meyn [Mey07], Powell [Pow07], Borkar [Bor08], Haykin 
[Hay08], Busoniu, Babuska, De Schutter, and Ernst [BBDIO], and the author’s text in preparation [Berlla]; 
the edited volumes and special issues by White and Sofge [WhS92], Si, Barto, Powell, and Wunsch [SBP04], 
Lewis, Lendaris, and Liu [LLL08], and the 2007-2009 Proceedings of the IEEE Symposium on Approximate 
Dynamic Programming and Reinforcement Learning; and the recent surveys by Borkar [Bor09], Lewis and 

I To appear in Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, by F. 

Lewis and D. Liu (eds.), IEEE Press Computational Intelligence Series. 
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Vrabie [LeV09], Werbos [Wer09], Szepesvari [SzelO], and Bertsekas [Berllb]. 

The purpose of this paper is to critically review and extend a class of methods for exact and approximate 
DP, which are based on the A-policy iteration (A-PI) method, proposed by Bertsekas and Ioffe [BeI96]. This 
method is intermediate between the classical value iteration (VI) and policy iteration (PI) methods, and 
it is closely related to optimistic (also known as modified) PI, whereby each policy evaluation is done 
approximately, using a finite number of VI. It was originally used as the starting point for the development 
of approximate simulation-based DP methods of the temporal difference (TD) type, such as LSPE(A) (see 
[BeI96], and also [BeT96], Sections 2.3.1 and 8.3). The emphasis in this paper is on implementations of A-PI, 
which provide alternatives to approximate PI methods that use other more established methods for policy 
evaluation. 

We will focus on the a-discounted n-state Markovian Decision Problem (MDP), although the main ideas 
are more broadly applicable. The problem involves states I,... ,n, controls u G U{i) at state i, transition 
probabilities pij(u), and cost g{i,u,j) for transition from i to j under control u. A (stationary) policy /r is a 
function from states i to admissible controls u G U{i), and J^{i) is the cost starting from state i and using 
policy /i. It is well-known (see e.g., Puterman [Put94] or Bertsekas [Ber07]) that the vector G K”, which 
has components is the unique fixed point of the mapping : JR” i-G SR”, which maps J G JR" to the 

vector J G K” that has components 

n 

+ i = l,...,n. (I.l) 

i=i 

Similarly, the optimal costs starting from i = I,...,n, are denoted J*{i), and the optimal cost vector 
J* G JR", which has components J*{i), is the unique fixed point of the mapping T : JR" i—>■ JR" defined by 

n 

{TJ){i)= min '^pij{u){g{i,u,j) +aJ{j)), i = (1.2) 

ueu(i) —■ 

An important property is that and T are sup-norm contractions. In particular, the iterations Jk+i = T^Jk 
and Jfc-i-i =TJk converge to Jfi and J*, respectively, from any starting point Jo - this is the VI method. 

A major alternative to VI is PI. It produces a sequence of policies and associated cost functions through 
iterations that have two phases: policy evaluation (where the cost function of a policy is evaluated), and 
policy improvement (where a new policy is generated). In the exact form of the algorithm, the current policy 
pL is improved by finding p that satisfies TfiJ^ = TJ^ [i.e., by minimizing in the right-hand side of Eq. 
(1.2) with Jfi in place of J]. The improved policy p, is evaluated by solving the linear system of equations 
Jy = TfiJp., and {Jy,p) becomes the new cost vector-policy pair, which is used to start a new iteration. 

f In our notation, JR" is the n-dimensional Euclidean space, all vectors in JR" are viewed as column vectors, and 
a prime denotes transposition. The identity matrix is denoted by I. 
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Thus, the exact form of PI can be succinctly defined as 


— TJk, Jk+i — , (1-3) 

with the equation on the left describing the policy improvement and the equation on the right describing 
the evaluation of /ifc+i. 

In a variant of the method, a policy /ifc+i is evaluated by a finite number of applications of to 

an approximate evaluation of the preceding policy. This is known as “optimistic” or “modified” PI, and its 
motivation is that in problems with a large number of states, the linear system Jk+i = T^i_,^-^Jk+i cannot 
be practically solved directly by matrix inversion, so it is best solved iteratively by VI. The method can be 
succinctly defined as 

Jk = TJk, Jk+i = TZX, Jk- (1.4) 

If the number toa, of applications of T)ifc+i i® very large, the exact form of PI is essentially obtained, but 
practice has shown that it is most efficient to use a moderate value of m^. In this case, the algorithm looks 
like a hybrid of VI and PI, involving a sequence of alternate applications of T and , with /ifc changing 
over time. Optimistic PI is generally believed to be more computationally efficient that either VI or PI. This 
is particularly so for problems where n is very large and implementation of exact PI is difficult due to the 
associated nxn matrix inversion, and also for problems with a large number of controls, where the overhead 
due to minimization over all controls u GU{i) in the mapping T [cf. Eq. (1.2)] is substantial. 

We note that the convergence properties of the optimistic PI method (1.4) are quite complicated and 
have been the subject of continuing research. The convergence Jk —>■ J* has been established by Rothblum 
[Rot79] (see also the more recent work by Canbolat and Rothblum [CaRll], which extends some of the results 
of [Rot79]). On the other hand, when optimistic PI is implemented asynchronously (as it normally would 
be when simulation is used), it may oscillate as shown by the convergence counterexamples of Williams and 
Baird [WiB93]. Recent work of Bertsekas and Yu [BeYlOa], [BeYlOb], [YuBll] has developed convergent 
variants of synchronous and asynchronous optimistic PI and Q-learning, based on a new way to perform 
policy evaluation: by solving approximately an optimal stopping problem rather than a system of linear 
equations. 

The A-PI method is a form of optimistic PI, given by 

T^k+iJk = TJk, Jk+i = T^^kiiJk, (1.5) 

where for any /r and A G [0,1), is the linear mapping given by 

OO 

= (1-A)^A^T^+i. (1.6) 

1=0 
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Note that the mapping is central in much recent research on approximate DP, simulation-based PI, and 
TD methods, as will be discussed in the sequel. 

To compare the optimistic PI method (1.4) and the A-PI method (1.5), note that both mappings 
appearing in Eqs. (1.4) and (1.5), involve multiple applications of the VI mapping 
a fixed number ruk in the former case (with rrik = 1 corresponding to VI and mk —> oo corresponding 
to PI), and a geometrically weighted number in the latter case (with A = 0 corresponding to VI and 
A —>■ 1 corresponding to PI). Thus optimistic PI and A-PI are similar: they just control the accuracy of 
the approximation Jk+i ~ applying VI in different ways. In a classical DP/non-simulation-based 

setting, A-PI is far more complicated relative to optimistic PI, since exact computations using the mapping 
are unwieldy. However, this advantage of optimistic PI is dissipated in a simulation context, where 
computations involving can be performed conveniently, as extensive analytical and experimental work 
with TD methods has demonstrated. 

Recent research on DP has focused on the use of simulation, in order to deal with model-free situations 
where the transition probabilities and/or the cost per stage are not known explicitly, and also to deal with 
the associated high-dimensional linear algebra operations. For problems with very large number of states, 
the evaluation of various fixed points of mappings, such as or is typically done by approximation 

with a vector $r from the subspace S' = {4>r | r G 5ft®} that is spanned by the columns of an n x s matrix $. 
In this paper we will focus on the projected equation approach, whereby given a generic mapping L : 5ft" i-G 5ft" 
(such as for example T^) we approximate its fixed point by solving the equation 

<I)r = nL($r), 

where H denotes projection onto the subspace S. The projection is with respect to a Euclidean norm || • ||^, 
weighted by a suitable vector ^ of positive weights. An alternative possibility is to solve instead the equation 

= nL(")($r), (1.7) 


where, similar to Eq. (1.6), 

OO 

= (1 — jy) 

and n G [0,1) is a parameter [not necessarily the same as the A parameter in Eqs. (I.5)-(1.6)]. In our context 
we will encounter several different types of mappings L, and in all cases L is a contraction with respect to 
the projection norm || • ||j, with fixed point J, while HLl") are contractions with respect to || • ||j for all 
n G [0,1). It is well-known that the fixed point of HLl"), denoted <I>r(i/), converges to HJ as > 1. The 
norm of the difference ^r(iy) — HJ is known as the bias. Its size/norm depends on v and is generally smaller 
as v gets closer to I (see [BeT96], [TsV97], [YuBlO] for error bound analyses). 
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A common example of fixed point approximation in PI is when L = Tfj, for a policy /x, in which case 
the fixed point of liL or is an approximation to the fixed point of T^, i.e., the cost vector J^. If the 

Markov chain corresponding to /x is irreducible and ^ is the corresponding steady-state distribution vector, 
the mapping IIT^^^ is a contraction with respect to || • ||^ for all A G [0,1), and is unique fixed point, denoted 
<I>r^(A), converges to IIJ^ as A —^ 1. Generally, the projected equation <I>r = is solved by a 

simulation process that generates a sequence of states according to a sampling scheme to be discussed later, 
and then by matrix inversion [this is the Least Squares Temporal Differences [LSTD(A)] method, proposed 
by Bradtke and Barto [BrB96]], or by iteration, using the TD(A) method, proposed by Sutton [Sut87] 
and analyzed by Tsitsiklis and VanRoy [TsV97] among others, or the Least Squares Policy Evaluation 
[LSPE(A)] method, proposed by Bertsekas and Ioffe [BeKGj.f These methods are extensively discussed in 
the literature, and exhibit complex and sometimes pathological behavior, particularly when embedded within 
PI (see [Ber95], [SzL06], [ThS09] for some notable failures, and [BerlO] for a recent assessment). Moreover 
matrix inversion and iterative methods, like TD(A), LSTD(A), and LSPE(A), can be used for solving not 
only the projected equation = flT/l'*'^ (<I>r), but also the more general equation = nL('^)(<I>r) of Eq. 
(1.7), as long as L is a linear mapping that is convenient for the use of simulation [and in the case of TD(A) 
and LSPE(A), IILl'^) is a contraction; see [BeY09] or [Berllc]]. 

In this paper we will review some of the basic issues in approximate PI using the projected equation 
approach, thereby setting the stage for assessing the relative strengths and weaknesses of the A-PI method¬ 
ology. We will then focus on three alternative implementations of A-PI, which involve simulation and cost 
function approximation. The first is basically the LSPE(A) method as implemented in [BeI96]. The second 
is an interesting recent proposal by Thiery and Scherrer [ThSlOa], who gave extensive and quite successful 
computational results, as well as error bounds [ThSlOb]. The third implementation is new and may have 
some advantages over the first two. We will argue that it deals better with the combined issues of bias 
and exploration. This implementation embodies a new idea for A-methods: a simulation scheme, called 

f The paper [BeI96] as well as the book [BeT96] used the name “A-policy iteration” for both the lookup table 
and the compact representation versions of the method described here, and tested a compact representation version 
on the game of tetris, a challenging SSP problem. The name “LSPE” was first used in the subsequent paper by 
Nedic and Bertsekas [NeB03] to describe a specific iterative implementation of the A-PI method with cost function 
approximation for discounted MDP (essentially the discounted version of the implementation used in [BeI96[ and 
[BeT96] for the aforementioned tetris case study). Reference [NeB03] proved convergence of the LSPE(A) method, 
as described in Section 3.1, for the case of a diminishing stepsize. Convergence for a stepsize equal to 1 was proved 
shortly afterwards by Bertsekas, Borkar, and Nedic [BBN04]. The use of two different names for essentially the same 
method has been a source of some confusion. While in practical implementations these two names refer to algorithms 
that are closely related, we reserve the name “A-policy iteration” for the more abstract form (1.5)-(1.6), and we will 
view LSPE(A) as an implementation of A-PI (see Section 4.1). 
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geometric sampling, that uses multiple short trajectories with random geometrically distributed length, and 
exploration-enhanced restart, rather than a single infinitely long trajectory. 

The three implementations are described in Section 4, following a discussion of the generic properties 
of exact A-PI in Section 2, and the LSTD(A) and LSPE(A) methods in Section 3. In our description, 
these implementations are model-based and use cost function approximation, but there are versions that are 
model-free and use Q-factor approximation; these can be straightforwardly constructed by the reader. 


2. LAMBDA-POLICY ITERATION WITHOUT COST FUNCTION APPROXIMATION 

We first recall a central result from [BeI96]. It provides a helpful characterization of the A-PI method (1.5), 
which will later become the basis for cost function approximations. 


Proposition 2.1: Given A G [0,1), Jk, and consider the mapping Wk defined by 

WkJ={l-X)T^,^,Jk + XT^,^,J. ( 2 . 1 ) 

(a) Wk is a sup-norm contraction of modulus Aa. 

(b) The vector Jk+i = Tl^j^^Jk generated next by the A-PI method (1.5) is the unique fixed point 
oiWk. 


Proof: (a) For any two vectors J and J, using the definition (2.1) of 114, we have 

\\WkJ-WkJ\\ = ||A(r^,+,J-T^,+,J)|| = A||r^,^,J-r^,^,J|| < Aa||J-J||, 

where || • || denotes the sup-norm, so 114 is a sup-norm contraction with modulus Aa. 

(b) We have 

OO 

Jfc+i = Jfc = (1 - A) E Jfc, 

i=0 

so the fixed point property to be shown, Jk+i = WkJk+i, is written as 

OO OO 

(1 - A) ^ = (1 - \)T^,+,Jk + AT^,+,(1 - A) ^ A^T^+,\ Jfe, 

1=0 e=o 

and evidently holds. Q.E.D. 
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From part (b) of the preceding proposition, we see that Jk+i = WkJk+i, or equivalently 


Jk+i{i) = ^p^J{nk+l{^)){g{^,^lk+l{i),j) + (1 - >^)aJk{j) + AaJfc+i(j)), i = 1,... ,n. (2.2) 

The solution of this fixed point equation can be obtained by viewing it as Bellman’s equation for two 
equivalent MDP. 

(a) As Bellman’s equation for an infinite-horizon Xa-discounted MDP where pk+i is the only policy, and 
the cost per stage is 

g{i,Pk+iii),j) + (1 - X)aJk{j). 

(b) As Bellman’s equation for an infinite-horizon stopping problem where pk+i is the only policy. In 
particular, Jk+i is the cost vector of policy pk-ki in & stopping problem that is derived from the given 
a-discounted problem by introducing transitions from each state j to an artificial termination state 
as follows: at state i we first make a transition to j with probability pij(^pk+i{i)) and transition cost 
g(^i, Pk+i{i),j)', then we either stay at j and wait for the next transition (this occurs with probability 
A), or else we move from j to the termination state with an additional termination cost aJk{j) (this 
occurs with probability 1 — A). All transition costs as well as the termination cost are discounted by 
an additional factor a with each transition. 

The convergence and rate of convergence of the A-PI method (1.5) was given in [BeI96] and also in 
[BeT96], Prop. 2.8. We will simply quote the results for completeness. 


Proposition 2.2: Assume that A € [0,1), and let {Jk,Pk} be the sequence generated by the A-PI 
method (1.5). Then Jk converges to J*. Furthermore, for all k greater than some index k, pk is 
optimal. 


Proposition 2.3: Let the assumptions of Prop. 2.2 hold and let k be the index such that for all 
k >k, pk is optimal. The sequence {Jk} generated by the A-PI method (1.5) satisfies for all fc > fc 

(2.3) 

1 — Aa 

where || - || denotes the sup-norm. 
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Note that the convergence rate estimate (2.3) holds only for k >k, essentially after an optimal policy 
has been identified, as per Prop. 2.2. Nonetheless, this rate estimate is qualitatively correct, and supports 
the empirical observation that the iterates {Jk,Hk) generated by A-PI converge faster as A increases. Indeed 
in the limit, as A —>■ 1, A-PI becomes exact PI, and converges to the optimum in a finite number of iterations. 
On the other hand, the computation of Jk+i = Tj^^k+i’^k [cf. Eq. (1.5)] becomes more time-consuming as A 
increases, particularly when simulation is used, because the simulation-based calculation of Jfc involves 
more simulation noise as A gets larger. 

We finally note that Props. 2.2 and 2.3 apply to synchronous implementations of A-PI. When imple¬ 
mented asynchronously, A-PI has similar convergence difficulties to optimistic PI. To see this, note that 
asynchronous implementations of these two methods essentially coincide when ruk = 1 in Eq. (1.5) and 
A = 0 in Eq. (1.4), and the counterexamples of Williams and Baird [WiB93] apply. Thus the development 
of convergent versions of asynchronous A-PI is an open research question. 

APPROXIMATE POLICY EVALUATION USING PROJECTED EQUATIONS 

In PI methods with cost function approximation, we evaluate /r by approximating with a vector <I>r^ from 
the subspace S = {‘hr | r G 3?^}, spanned by the columns of an n x s matrix $, which may be viewed as 
basis functions. We generate an “improved” policy p, using the formula = r(<I)r^), i.e., 

n 

/i(f) G arg mm '^pij{u){g{i,u,j) + a(j>{jyrfj,), i = 
ueu(i) 

i=i 

where 4>{j)' is the row of $ that corresponds to state j [the method terminates with /i if T^{^r^) = T(<I>r^)]. 
We then repeat with /x replaced by p. For the purposes of this paper, we assume that $ has rank s, and 
that the Markov chain corresponding to /x is irreducible. 

As noted earlier, in the projected equation approach to approximate PI, we approximate with a 
vector of the form $r^(A) that solves the fixed point problem 

$r = nT;i^^($r). (3.1) 

Here H denotes projection onto the subspace S with respect to a weighted Euclidean norm || • ||^, where 
5 = (^i,...,^„) is a probability distribution with positive components (i.e., ||J||| = where 

> 0 for all i). In nonoptimistic PI methods, the projected equation (3.1) is solved exactly, while in 
optimistic PI methods it is solved approximately. We note that this approach has a long history in the 
context of Galerkin methods for the approximate solution of high-dimensional or infinite-dimensional linear 
equations (partial differential, integral, inverse problems, etc; see e.g., [Kra72], [Fle84]). In fact some of the 
policy evaluation theory referred to in this paper applies to general projected equations arising in contexts 
beyond DP (see [BeY09], [Ber09], [YulOa,b], [Berllc]). However, Monte Carlo simulation is not part of the 



Galerkin methodology, as currently practiced in the numerical analysis field. For this reason much of the 
extensive available knowledge about Galerkin methods does not apply to the approximate DP context, which 
is primarily simulation-oriented. 

We now discuss some of the issues relating to projected equations. While we focus on Eq. (3.1), much 
of our discussion also applies to the more general projected equations. 


Exploration-Contraction TradeofT 


An important choice in the projected equation approach is the distribution ^ that defines the projection 
norm || • ||{. This distribution is sometimes chosen to be the steady-state probability vector of the Markov 
chain corresponding to /r, in which case the mapping can be shown to be a contraction with respect 

to II • 11^^ with modulus 


ax 


a{l — A) 
1 — Aa 


(3.2) 


(see [BeT96], Lemma 6.6, or [Ber07], Prop. 6.3.3). 

On the other hand the choice of ^ is related to exploration, i.e., the need to collect an adequately rich 
set of samples from a broad and representative set of states. This is a critical issue in simulation-based PI, 
and results in a well-known tradeoff: to evaluate a policy /i, we may need to generate cost samples using 
/i, but this may affect the simulation results by underrepresenting states that are unlikely to occur under 
/X (more weight is placed on states that are visited more frequently under /x). As a result, the cost-to-go 
estimates of the underrepresented states may be highly inaccurate, causing potentially serious errors in the 
calculation of the improved control policy. 

A well-known approach for exploration is to choose ^ to be a mixture of the form 


?=(l-/3)eM + /3?. (3.3) 

where /3 € (0,1) and ^ is another distribution (often referred to as the off-policy distribution), which is added 
to enhance exploration (see the discussion of Section 1). Unfortunately, with such a choice the contraction 
property of comes into doubt: it depends on the size of the parameters A and /3 [it can be shown 

that nPyi'^^ is a contraction for any /3 S [0,1) provided A is close enough to 1, and it is a contraction for any 
A e [0,1) provided (3 is close enough to 0]. This is important because for convergence of iterative methods 
such as TD(A) and some forms of LSPE(A), it is critical that IIP^^^ be a contraction. Thus there is a tradeoff 
between exploration enhancement using the mixture distribution (3.3) and ability to use a broader range of 
methods for solution of the projected equation. 


Bias 
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While the Bellman equation J = J has the same fixed point for all A S [0,1), the fixed point $r^(A) 
of the projected version (3.1) depends on A. The difference of $r^(A) and the closest point of S to J^, 
<l>r^(A) — nj^, is generally nonzero. Its norm, the bias, tends to decrease to 0 as A t 1 and tends to increase 
as A j, 0. It is known that the bias can be very large and may seriously degrade the practical value of the 
approximate policy evaluation for small values of A; see [Ber95] for some examples. 

The following is a well-known error bound for the case ^ = ^fj,: 

WJfj, — ‘f’?'At(A)||^^ < / f W^tJ. ~ (3.4) 

where ax is given by Eq. (3.2), and || • is the weighted Euclidean norm corresponding to ^ the 

steady-state probability vector of the Markov chain corresponding to /r. Thus the error bound becomes worse 
as A decreases (and a\ increases), suggesting a larger size of bias. While the bound is rather conservative, 
the paper by Yu and Bertsekas [YuBIO] (see also Scherrer [SchIO]) derives sharper error bounds, which also 
apply to cases where ^ ^ and is not a contraction. These error bounds and the bound (3.4) are 

consistent in suggesting that the bias increases as A decreases, and they are also largely consistent with the 
results of computational experimentation. 

Bias-Variance Tradeoff 

In simulation-based methods for solving the projected equation (3.1), one must deal with the effects of 
simulation error. Generally as A increases, the methods become more vulnerable to simulation noise, and 
hence require more sampling for good performance. Indeed, the noise in a simulation sample of an Gstages 
cost vector TpJ tends to be larger as £ increases, and from the formula 

OO 

= (I - A) ^ A^T^+^ 

1=0 

it can be seen that simulation samples of Tj/'^^rk) tend to contain more noise as A increases. This is 
consistent with practical experience, and gives rise to the so called bias-variance tradeoff: a large value 
of A to reduce bias results in slower and less reliable computation because of higher simulation noise (and 
consequently, a larger number of samples to achieve the same accuracy of various simulation-based estimates). 
Generally, there is no rule of thumb for selecting A, which is usually chosen with some trial and error. 

In summary, the preceding discussion suggests that if simulation noise is not an issue (i.e., one can 
afford many simulation samples) one should choose large values of A, since then the bias is reduced and 
one may afford greater exploration without losing the contraction property of IIT^^^. In the contrary case, 
however, the degradation of the estimate of Jp due to simulation noise may offset whatever bias/contraction 
benefits a large value of A may bring. 
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3.1 TD Methods 


Most of the simulation-based methods for solving the projected equation use explicitly or implicitly the 
notion of temporal difference (TD), which originated in reinforcement learning with the works of Samuel 
[Sam59], [Sam67] on a checkers-playing program. The first TD method is TD(A), which can be viewed as 
an iterative stochastic approximation-type algorithm. The LSTD(A) method is based on batch simulation: 
it first generates a batch of state and cost samples, it approximates the projected equation <f>r = (<f>r) 

using these samples, and then solves the equation directly by matrix inversion. Another TD method is 
LSPE(A), which while being more iterative, shares much of the simulation philosophy of LSTD(A). 

To describe more specifically the LSTD(A) and LSPE(A) methods, we first note that the orthogonality 
condition that characterizes the projection in the projected equation <I)r = nT^^^(<i)r) is 

$'S($r - Tl,^\^r)) = 0, (3.5) 

where S is the diagonal matrix with the vector ^ along the diagonal (see e.g., [Ber07]). Thus the projected 

equation (3.1) is equivalent to the lower-dimensional equation (3.5), which can in turn be written in matrix 
form as 

C(\)r = dW, (3.6) 

with 

CW = $/s(/ - (3.7) 

and 

OO OO 

= (1 - A) ^ X^a^+^Pll+\ ^ (3.8) 

i=0 i=0 

where Pfj. and g^ are the transition probability matrix and expected single-stage cost vector corresponding 
to 11 . The LSTD(A) and LSPE(A) methods use simulation-based approximations of and d^^\ This is 
done by simulating a state sequence (zq, ..., it) and corresponding transition cost sequence, using the current 
policy g (perhaps with exploration enhancement, as discussed earlier). Then after each simulated state if,, 
I = 0,..., t, is generated, estimates and are obtained using the simulation samples up to time 
I, using formulas that we will not give here, as they are not important for our purposes. Such formulas, 
in various alternative forms, can be found in several sources, including the textbooks cited earlier. The 
papers [NeB03], [BeY09], [YulOa], [YulOb] discuss the conditions for the convergence lim^^oo 
lim^^oo to hold with probability 1. 

The LSTD(A) method is based on simple matrix inversion: after the last state it of the simulation 
trajectory is generated, it computes the solution 

(3-9) 
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of the corresponding simulation-based approximation to Eq. (3.6) 




(3.10) 


and approximates the cost vector by $f. An important point is that f can be obtained regardless of 
whether is a contraction. It is only required that is invertible, a much less restrictive condition. 

One version of the LSPE(A) method consists of iterative solution of the system (3.10). It approximates 
the cost vector by d>rt+i, where rt+i is obtained at the last step of the iteration 

r,+i £ = 0,...,t, (3.11) 


where ro is some initial vector, likely the vector obtained from the preceding policy evaluation, 7 is a positive 
stepsize, Gi is the matrix 


Gi 


1 

1 E 


.)<('(iri 


m=0 


= o,...,t, 


(3.12) 


and as earlier, (j>{i)' denotes the ith row of the matrix $. In the original proposal of [BeI96] the stepsize is 
7 = 1; convergence of d’rt to the fixed point of IIT^^^ for this stepsize was shown in [BBN04]. The matrix Gi 
is a simulation-based approximation of (<i>'S$)“^ (alternative choices of Ge have been discussed recently in 
[Berllb], [Berllc]). There is also an equivalent implementation of this iteration, which is based on solution 
of a least squares problem (see Section 4.1). 

The choice (3.12) for G( and the use of 7 = 1 are based on a view of the method as an approximation 
to the projected value iteration method 




which after some calculation can be written as 




or equivalently, since $ has full rank, as 


r^+i = Ti — (4>'S$)“i 


cf. Eq. (3.11)-(3.12) with 7 = 1 . 

Note that the matrix inversion in Eq. (3.12) is not so onerous, because it can be formed incrementally, 
with a rank-one correction as each sample becomes available. On the other hand, contrary to LSTD(A) [and 
similar to TD(A)], the LSPE(A) method (3.11)-(3.12) requires that IIT^'*'^ be a contraction for convergence. 
Indeed if the simulation is performed using the steady-state distribution it can be shown that IIT^i^^ is a 
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contraction, but if the simulation is performed using a mixture/off-policy distribution (3.3) for the purpose 
of exploration-enhancement, the contraction property may be lost and repeated iterations of the form (3.11) 
may diverge. 

We finally note that in iteration (3.11) the underlying assumption is that we update r as simulation 
samples are collected and used to form ever improving approximations to C and d. An alternative is to 
use batch simulation, like in LSTD: first simulate to obtain and Gt, and then solve the system 

iteratively rather than through the direct matrix inversion (3.9), by using any number of 
iterations of the type (3.11). In fact, we may use only one iteration, in which case the method takes the 
form 

n = ro - iGt (G^^Vo - 4^^) • (3-13) 

A single (or very few) iterations may be sufficient if A is close to 1, since then the contraction modulus of 
is close to 0 (see e.g., [BeT96], Lemma 6.6, or [Ber07], Prop. 6.3.3), so a single iteration with IIT^^ is 
very effective, yielding a vector that is close to its fixed point. We will return to this variant of the method 
later. 

3.2 Comparison of LSTD(A) and LSPE(A) 

There has been speculation about the relative merits of LSTD(A) and LSPE(A). Generally speaking, it is 
difhcult to reach definitive conclusions, as there are several complex factors to consider, such as the length 
of the simulation sequence (iq, ... ,it), and the potential near-singularity of GGl, which affects the error in 
the matrix inversion in the LSTD(A) formula (3.9). As an illustration, consider a few different situations: 

(a) Assume, as an idealization, that an infinite number of samples is collected. Then both methods yield 
in the limit the same result, the fixed point of the projected equation J = flT^^A However, in contrast 
to LSTD(A), in order to guarantee convergence, LSPE(A) requires that HT^^ is a contraction, which 
interferes with the freedom to do exploration, as discussed earlier. 

(b) Assume that is invertible, but is nearly singular. Then the matrix inversion in the LSTD (A) 
formula (3.9) may require a very large number of samples to yield a reasonably accurate solution of 

To correct the sensitivity of LSTD(A) to simulation noise, it may be necessary to turn 
it into an iterative method through some form of regularization, which then brings it close to a form of 

f It is well-known from fundamental error analyses of linear equation solvers that small errors in a nearly singular 
matrix CG'> will cause large errors in the solution of = dG\ Near-singularity of CG'> may be due either to 

the columns of <1? being nearly linearly dependent or to the matrix H(7 — oPGl) being nearly singular [cf. Eq. (3.7)]. 
Near-linear dependence of the columns of <1? will not affect the error in the solution of the high-dimensional projected 
equation, which can be written as The reason is that this error depends only on the subspace S and 
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LSPE(A) (see [Ber09], [WPB09], [Berlla], [Berllb], [Berllc] for such regularization methods and their 
connection to LSPE). Of course, the situation becomes even more complex if is singular, perhaps 
due to inadvertent rank deficiency of $ (see [WaBlla], [WaBllb] for a discussion of this possibility). 

(c) When LSTD(A) and LSPE(A) are embedded within a PI framework, the number of samples collected 
using any one policy is often relatively small. Then the behavior of the two methods becomes very 
complicated, and it is hard to reach any kind of reliable conclusion [BerlO]. Computational studies 
indicate that LSPE(A) being an iterative method, is less sensitive to the matrix inversion errors that 
afflict LSTD(A) in the presence of high simulation noise. 

The preceding discussion is also relevant to the implementations of A-PI to be discussed in the next section, 
since these implementations bear strong relations to both LSTD(A) and LSPE(A). 

4. LAMBDA-POLICY ITERATION WITH COST FUNCTION APPROXIMATION 

We saw in Section 2 that the policy evaluation portion of A-PI, 

.h+i=Tl^^+,Jk, (4.1) 

[cf. Eq. (1.5)] can be implemented in two ways: 

(1) By computing 

(2) By finding the fixed point of the mapping 114 [cf. Eq. (2.1)] through solution of the equation 

J = WkJ, (4.2) 

which can be viewed as Bellman’s equation associated with the current policy for the two equivalent 
DP problems discussed in Section 2 [cf. Eq. (2.2)]: a Aa-discounted problem and a stopping problem. 

Let us now consider approximation of A-PI on the subspace S' = {$r|rS5ft®}. A natural possibility 
is to introduce projection in the preceding approaches. In particular, we may approximate the A-PI iterate 
Jk+i of Eq. (4.1) by ^Vk+i in three ways: 

not its representation in terms of the matrix 4>. In particular, if we replace 4> with a matrix where B is an s x s 
invertible scaling matrix, the subspace S will be unaffected and the error in the solution of the projected equation 
will also be unaffected. On the other hand, near singularity of the matrix I — may affect significantly the error. 

Note that I — is nearly singular in the case where a is very close to 1, or in the corresponding undiscounted 

case where a = 1 and P is substochastic with some eigenvalues very close to 1. Large variations in the size of the 
diagonal components of H may also affect significantly the error, although this dependence is complicated by the fact 
that H appears not only in the formula = 4>'H(/ — aPl^l)4> but also in the formula 
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(a) By using a single projected value iteration for the original a-discounted problem, 

$rfc+i=nri^^^($rfc). (4.3) 

This is the original proposal of [BeI96]. It is the variant of the LSPE(A) method (3.11)-(3.12), which 
involves just the last iteration. 

(b) By solving a projected version of Eq. (4.2), viewing it as Bellman’s equation for the Aa-discounted 
problem of Section 2, and setting rk+i equal to its solution. This is the proposal of [ThSlOa], and 
implements by simulation the solution of this projected equation, essentially by applying LSTD(O) to 
Bellman’s equation for the Aa-discounted problem formulated in Section 2. 

(c) By solving a projected version of Eq. (4.2), viewing it as Bellman’s equation for the stopping problem 
formulated in Section 2, and setting Vk+i equal to its solution. 

In the following three subsections, we will describe three alternative implementations of A-PI corre¬ 
sponding to the possibilities (a)-(c) above. Of course when linear cost function approximation of the form 
<I>rfc is used to represent Jk, the A-PI method need not converge, and the cost vectors of the generated 
policies typically oscillate within some suboptimality threshold from J*. We do not address this issue, but 
we note that related error bounds, which also apply to other forms of optimistic PI are given by Bertsekas 
and Yu [BeYlOa], Thiery and Scherrer [ThSlOb], and Scherrer [Schll]. 

4.1 The LSPE(A) Implementation 

A variant of the LSPE(A) method (3.II)-(3.12) is to form batches of simulation samples and perform iteration 
(3.II) at the end of each batch. In an extreme case, we treat the entire simulation trajectory (io ,... ,it) as 
a single simulation batch, and we perform a single iteration (3.11), for £ = t, yielding the method 

rk+i =rk-Gt - 4^)), (4.4) 

where ^rk is the approximate evaluation of the cost vector of the preceding policy fik [cf. Eq. (3.13)]. As 
t > oo and the simulation becomes exact in the limit, i.e., 

lim = CA), lim = dA), 

t—¥oo t—¥oo 

and if Gt is given by the formula (3.12), it can be verified that 

<£>rk+i^u4^l,i<^rk). (4.5) 

Thus the method (4.4) with Gt given by Eq. (3.12) can be viewed as a simulation-based implementation of 
Eq. (4.3), the projected version of A-PI, which becomes exact in the limit as t —> oo. In practice of course 
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t is finite, and one may consider variants of the method, whereby multiple iterations of the form (4.4) are 
performed, with each iteration using additional simulation samples. 

We note a mathematically equivalent description of this method, which is given in terms of a least- 
squares optimization (see [Ber07], Section 6.3.3 for a more detailed textbook account): we set 

{Xa)"^-^q{im,im+i)^ , (4.6) 

where q{im,im+i) is the temporal difference 

q{im^ f m-t-1) — q (fm j ^k+l iXm) f m-t-l) “f Y'k’k ^(fm)^rfc, 777 . = 0,...,t. (4.7) 

In fact this is how the method was originally described in [BeI96] and [BeT96]. 

A positive aspect of this method is that it approximates directly nT^^^^($rfe), so it is not subject to 
bias in the evaluation of the fixed point of Wk] cf. Eq. (4.5). However, in the form given here, the method 
does not address the issue of exploration. Despite this fact, this implementation [in the form (4.6)] has been 
successful in several challenging computational studies, including the one involving the game of tetris in the 
original paper [BeI96] and some followup works, and a recent one by Foderaro et. al. [FRFll] involving the 
game of pac-man, a benchmark problem of pursuit-evasion. 

4.2 A-PI(O) - An Implementation Based on a Discounted MDP 


t / t 

Tk+i = arg min (l>(ii)'r - (j){ii)'rk - 

^—0 \ m —£ 


This implementation, suggested and tested by Thiery and Scherrer [ThSlOa], [ThSlOb], is based on the fixed 
point property of Jk+i [cf. Prop. 2.1(b)]. It produces an approximation ^r^+i to Jfc+i within the subspace 
S, by solving the projected equation 

<I)r = niFfc(<I)r), 


with Wk given by 


WkJ={l-X)T^,^,{<Prk) + XT^k+iJ, ( 4 - 8 ) 

We may find the solution Vk+i of this equation by using an LSTD(0)-like simulation approach. In particular, 
Tk+i satisfies the orthogonality condition 

Cr = d{k), 


where 

C = 4>'S(/ - XaP^,^,)<P, d{k) = + (1 - A)aP^,^,$rfe), 


so that 


rk+i=C-^d{k). 


(4.9) 
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We refer to this method as A-PI(O) to distinguish it notationally from the method of the next subsection 
(the name LSAPI was introduced for this method in [ThSlOa]). 

In a simulation-based implementation, the matrix C and the vector d{k) are approximated by estimates 
Ct and dt{k). Thus this method does not require that is a contraction, and like LSTD, it can deal well 

with the issue of exploration. The simulation samples need not depend on the policy Hk+i being evaluated, so 
they can be generated only once within a PI process. On the other hand the objective of the implementation 
is to approximate the next iterate of A-PI, i.e., (d’J’fe), and it is not clear that it is doing this well. To 

see this, suppose that the iteration (4.9), or equivalently = T\Wk{^rk), is repeated an infinite number 

of times so it converges to a limit f, which must satisfy <I>f = niTfc($f). Then using Eq. (4.8), we have 

= (1 - A)nT^,^, (<i>f) + AnT^,^^ (cbf), 

which shows that <I>f = ($f). Thus A-PI(O) aims at f, which is the limit of TD(0) independent of 

the value of A. Indeed as A ^ I, IIITfc tends to [cf. Eq. (4.8)], so its fixed point d’rfc+i tends to the 

fixed point of , i.e., the limit of TD(0). It follows that while this implementation deals well with the 

issue of exploration, it may be subject to significant bias-related error. 

4.3 A-PI(l) - An Implementation Based on a Stopping Problem 

The third implementation is based on the property mentioned in Section 2: the fixed point equation J = WkJ 
[or equivalently, Eq. (2.2)] is Bellman’s equation for the policy /x^+i in the context of a stopping problem. 
Here there is an artificial termination state 0, and for all states j, there is probability 1 — A that a transition 
to j will be followed by an immediate transition to state 0, with cost aJk{j), cf. Eq. (2.2). Note that if A 
is not too close to 1, the trajectories of this problem tend to be short, and in fact if A = 0 all trajectories 
consist of a single transition. 

To compute an approximation d’rfc+i to the fixed point of Wk by using the stopping problem, we may 
use any policy evaluation algorithm with cost function approximation over the subspace S = {‘hr j r G 5ft®}. 
An interesting choice is to use the LSPE(l) method, which consists of a least squares fit of d’r to the 
simulated costs of the trajectories of the stopping problem whose Bellman equation mapping is Wk- The use 
of LSPE(l) not only involves minimum bias relative to all LSPE(i/) methods with ly € [0,1], but also leads 
to a simple least squares implementation. 

To this end, we introduce a simulation procedure, called geometric sampling, which departs from the 
single infinitely long simulation trajectory format of the implementation of Section 4.1, and has the following 
characteristics: 

(a) It uses multiple relatively short simulation trajectories. 
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(b) The initial state of each trajectory is chosen essentially as desired, thereby allowing flexibility to generate 
a richer mixture of state visits. 

(c) The length of each trajectory is random and is determined by a A-dependent geometric distribution [a 
probability (1 — A)A^ that the number of transitions is £ + 1]. 

In particular, given the current representation ^rk of Jk and the current policy /ifc+i, we update 
the parameter vector from rk to rk+i after generating t simulated trajectories. The states of a trajectory 
are generated according to the transition probabilities pij(/Xfc+i(i)), the transition cost is discounted by an 
additional factor a with each transition, and following each transition to a state j, the trajectory is terminated 
with probability 1 — A and with an extra cost a4>(i)'rk. Once a trajectory is terminated, an initial state for 
the next trajectory is chosen according to a fixed probability distribution (o = (Co(l)) ■ • ■ )Co(^)); and the 
process is repeated. Note that the sequence of restart states need not depend on the policy being evaluated, 
so that it can be simulated only once within a PI process. Of course, the simulated trajectories have to be 
recalculated for each new policy. The details are as follows. 

Let the mth trajectory, m = 1,... ,t, have the form where io,m is the initial 

state, and iNm,m is the state at which the trajectory is completed (the last state prior to termination). For 
each state ie, 7 n, i = 0,..., Nm — 1, of the mth trajectory, the simulated cost is 


where 


ci,m(rk) = ^(t>{iNm,myrk 


^ a? ^g(i 

q,m^ Uq^m-, iq-\-l^m) •> 

q^i 




(4.10) 


Once the costs Ci^mirk) are computed for all states of the mth trajectory and all trajectories m = 1,..., f, 
the vector rt+i is obtained by a least squares fit of these costs: 

t Nm-l 

Tk+i = arg min - Q,m(?'fe)) , (4-11) 

r€3x'® 

m—1 

cf. Eqs. (4.6)-(4.7). Equivalently, we can write the solution of the least squares problem explicitly as 

/ i \ i 

Tk+l = E E ,m ,m)^ E E 4>{ii )q {rk). (4.12) 

\m—1 l—O / m—1 £—0 

We refer to the resulting implementation as A-PI(l). 

Note the extreme special case when A = 0. Then all the simulated trajectories consist of a single 
transition, and there is a restart at every transition. This means that the simulation samples are from states 
that are generated independently according to the restart distribution Cq. 
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Convergence of the Simulation Process 


We will now show that in the limit, as t —>■ oo, the vector rk+i of Eq. (4.12) satisfies 


$rfc+i = nr, 


w 


Mfc+1 




(4.13) 


where n denotes projection with respect to the weighted sup-norm H-H^j with weight vector C = (C(l )5 • • ■ )C(''^))) 
where 

C(*) 


C(*) = 


E”=iC(jr 




(4.14) 


and 


C(*) = 


e=o 

with Q(i) being the probability of the state being i after £ transitions of a randomly chosen simulation 
trajectory. This is the underlying norm in TD methods such as LSTD, LSPE, and TD, as applied to SSP 
problems (see [BeT96], Section 6.3.4). Note that ({i) is the long-term occupancy probability of state i 
during the simulation process. We assume that the restart distribution is chosen so that ({i) >0 for all 
i = 1,... ,n, implying that || • H,^ is a legitimate norm [this is guaranteed if we require that Co(*) > 0 for all 
*]• 

Indeed, let us view as the vector of total discounted costs over a horizon of {£ + 1) stages with 

the terminal cost function being J, and write 

i 

9=0 

where are the transition probability matrix and cost vector, respectively, under /ifc+i. As 

a result the vector J = (1 — A) expressed as 


it) = + E Mfc+l(*9), i,+ l) 




9=0 


to = I 


i = 1,... ,n. 


Thus j)(i) may be viewed as the expected value of the {£ + l)-stages cost of policy fXk+i starting at 

state i, with the number of stages being random and geometrically distributed with parameter A [probability 
of K+1 transitions is (1 —A)A'‘, k = 0,1,...]. It follows that the cost samples ci^mifk) of Eq. (4.10), produced 
by the simulation process described earlier, can be used to estimate (‘^’^fe)) (*) all i by Monte Carlo 

averaging. The estimation formula is 

t A^m —1 

E E Siii 

,m — i)ce (rk), (4.15) 


1 


r/- _ Z—X Z—X 

2^Tn=l 2^1=0 0[ll^rn — t) 

where = i) = 1 if ie^m = i and = z) = 0 otherwise, and we have 

{Tkt\ii^rk))ii) = lim A(z), 


i = 1,... ,n. 
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(see also the discussion on the consistency of Monte Carlo simnlation for policy evaluation in [BeT96], Section 
5.2). 


Let us now compare the A-PI iteration (4.13) with the simulation-based implementation (4.12). Using 
the definition of projection, Eq. (4.13) can be written as 


n 2 

rfc+i = arg mm ^C(*) (</'(*)'?■- (^?'fc)) (O) , 


2=1 


or equivalently 


-1 


Tk+l = u T. c(*)</>(i)(ryy($rfe))(z). 


(4.16) 


Let C(i) be the empirical relative frequency of state i during the simulation, given by 

t Vm —1 


C(*) = 


1 


iVi + ■ • • + iVt 




m=l £=0 


(4.17) 


Then the simulation-based estimate (4.12) can be written as 


t Nm-i 


t Nm-l 


rk +1 = E E 4>{ii ,m ,m)^ E E (t>{H ,m )ci ,m (rk) 

\m=l £=0 / m=l £=0 

/ n t Vm —1 \~^n t Nm — l 

= EE E = *)<()(*)<()(*)' EE E = i)(i){i)ci,m{rk) 


^ 2=1 m—1 .^=0 


= E 


2=1 m—1 i—Q 

t Nm — 1 


V2 = l 


2=1 


Ni + --- + Nt 


■ ■ E E ,772 - i)ce 

,772 (rk) 


772=1 i—0 


=(trt.wwB'V 


^2=1 




J2m=l J2e=yo — i) m=l i=0 


^ ^ ^ ^ ^(U.m — ^)c£,m(r’fc) 


and finally, using Eqs. (4.15) and (4.17), 

/ 72 \ n 

rk+1 = ( X C(*)</>(*)</>(*)' ) X C(*)<('(0A(f). 


(4.18) 


We can now compare the A-Pl iteration (4.16) and the simulation-based implementation (4.18). Since 
(ryy ($rfc))(i) = limt_s.oo 77t(i) and (^{i) = limt_s.oo C(*)) see that these two iterations asymptotically 
coincide. 

The expression (4.18) provides some insight on how A-Pl(l) approximates the A-Pl iteration (4.16) [or 
equivalently 4>rfc+i = (<l>rfc); cf. Eq. (4.13)]. Generally the simulation process of A-PI(l) (many short 

trajectories) involves more noise than the simulation process of the other implementations (a single long 
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trajectory), because the length of each simulation trajectory is random (exponentially distributed). This 
can be seen from iteration (4.18), which involves considerable simulation noise due to the presence of (((z) 
and Dt{i). However, we will argue that from a practical point of view much of this noise does not play a 
significant role. To see this, first note that the deviation of C(*) from C(*)j is not important since ('(z) simply 
redefines the projection norm. Next note that Dt{i) can be written as 

CXD 

Dt{i) = Y.h(i)m), (4.19) 

fco 

where fe(i) and E^li) are the following empirical averages over the entire simulation process: 

(a) fi{i) is the empirical relative frequency of cost samples that start at state z, and correspond to tra¬ 
jectories consisting of £ + 1 transitions. As t —^ oo it converges to (1 — A)A^ based on the way the 
simulation is structured. 

(b) Ee{i) is the Monte Carlo estimate of the cost of trajectories that start at state z, consist of £ -b 1 
transitions, and have terminal cost vector ^rk- As t > oo it converges to (4>rfc)(z). 

While both /^(z) and Ee{i) contribute to the variance of Dt{i), only Ee{i) has practical significance. To see 
this note that based on Eq. (4.19), Dt{i) can also be viewed as an estimate of 

OO 

T’Mfc+i(<&^fc)(*) = ^/,(z)T^+^\($rfc)(z). (4.20) 

1=0 

Thus iteration (4.18) may also be viewed as a simulation-based implementation of the optimistic PI method 

$rfc+i = 

where H is projection with respect to the weighted sup-norm defined by C,. From a practical point of view, 
this iteration and the A-Pl iteration 4>rfc+i = ($rfe) perform similarly: there is only a difference in 

the projection norm (ft rather than H), and a difference in the weights of the terms [/^(*) rather than 

(1 — A)A^]; compare as given by Eq. (4.20) with 

OO 

Ti^li($rfc)(z) = ^(1 - 
1=0 

the definition of . Neither difference should affect significantly the quality of the obtained approximation 

^rk+i. 

In conclusion, with the A-Pl(l) implementation (4.10)-(4.12), as t > oo, we obtain in the limit the A-Pl 
iteration Eq. (4.13), with comparable performance degradation due to simulation noise as for the LSPE(A) 
implementation of Section 4.1. A key characteristic of the implementation is that it deals with the issue of 
exploration flexibly and effectively. Since a trajectory of the stopping problem is completed at each transition 
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with the potentially large probability 1 — A, a restart with a new initial state io is frequent and the length 
of each of the simulated trajectories is relatively small. The restart mechanism can be used as a “natural” 
form of exploration, by choosing appropriately the restart distribution Co so that C(*) reflects a “substantial” 
weight for all states i. Thus A-PI(l) is like LSPE(A) (Section 4.1), but with built-in exploration enhancement. 
Compared to A-PI(O) (Section 4.2) it involves reduced bias since it aims to find the limit point of TD(A), 
not TD(0). In particular, as A —>■ 1, it produces an evaluation d’rfc+i that tends to the fixed point of TD(1), 
i.e., the projection 

4.4 Comparison with Alternative Approximate PI Methods 


The preceding A-PI implementations are in direct competition with approximate PI methods that use 
LSTD(A) for policy evaluation. A popular method, often referred to as LSPI (Lagoudakis and Parr [LaP03]), 
can be simply described as approximate PI combined with LSTD(O) for policy evaluation. The LSPI and 
A-PI(O) methods have been compared in [ThSlOa] in terms of four characteristics. 

(a) Bias: Both methods are subject to qualitatively similar bias [they aim to find the limit point of TD(0)]. 

(b) Sample efficiency: Both methods can reuse the same set of sample state trajectories over all policies. (In 
the model-free case where Q-factors are approximated, again the set of sample state-control trajectories 
is reusable.) 

(c) Exploration: Both methods provide the same options for exploration, since the validity of these methods 
does not depend on whether the simulation trajectories are obtained by using the current policy [in 
fact these trajectories are reusable as per (b) above]. 

(d) Optimistic operation: Since A-PI(O) has an iterative character (r^+i depends on r^), it is less susceptible 
to simulation noise and has an advantage over LSPI in the case where the number of samples per policy 
is low. Indeed this assertion is made by Thiery and Scherrer [ThSIOa] based on experimentation, who 
also found that the effect of the choice of A is more pronounced in this case. 

Note that (b) and (c) above are the advantages of LSPI and A-PI(O) over the LSPE(A) implementation of 
Section 4.1 (which in turn involves less bias because of the use of A > 0, and also has an optimistic character). 

Let us now compare A-PI(I) with LSPI and A-PI(O) in terms of the characteristics (a)-(d) above. It 
has better bias characteristics as noted earlier. It has worse sample efficiency as it cannot reuse simulation 
trajectories (it can only reuse the restart state sequence). It deals with exploration about as well, thanks 
to the restart mechanism of the SSP formulation. Finally, like A-PI(O), A-PI(I) has an optimistic character, 
and has a similar advantage over LSPI in this regard, cf. (d) above. 
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4.5 Exploration-Enhanced LSTD(A) with Geometric Sampling 


The geometric sampling idea underlying the A-PI(l) implementation of Eqs. (4.10)-(4.12) may also be mod¬ 
ified to obtain an exploration-enhanced version of LSTD(A). In particular, we use the same simulation 
procedure, and in analogy to Eq. (4.10) we define 


Nm-l 


Cl 


,ra{r) = a’^'^-^(j){iNrr„Ta)'r + ^ 


We then obtain an approximation to the solution of the projected equation 




[cf. Eq. (4.13)] by finding r such that 

t — 1 

f = arg mm ^ {4>{ii,m)'r - ci^rn{f)) ■ 

m —1 £—0 


(4.21) 


By writing the optimality condition 

t Nm-l 

- Cl^m{f)) =0 

m=l 1=0 

for the least squares minimization in Eq. (4.21) and solving for f, we obtain the following implementation 
of LSTD(A): 

f = C-irf, (4.22) 


where 


and 


t Nm — 1 

m—1 £—0 

(4.23) 

i — 1 — 1 

(4.24) 


m—1 £—0 q—£ 

For a large number of trajectories t, the exploration-enhanced LSTD(A) method (4.21) [or equivalently 
(4.22)-(4.24)] and A-PI(l) [cf. Eq. (4.12)] yield similar results, particularly when A « 1. However, A-PI(l) 
has an iterative character (r^+i depends on rj,), so it is reasonable to expect that it is less susceptible to 
simulation noise in an optimistic PI setting where the number of samples per policy is low. 


As an example, when A = 0, all the simulation trajectories consist of a single transition, so Nm = 1 for 
all m = 1,..., t. Then, using Eqs. (4.23) and (4.24), the equation Cr = d becomes 

i t 

,m ) - a(j>{ii ))V = E <l>{io )9{io 

m—1 m—1 

It yields the same vector r = C~^d as the LSTD(O) method that simulates t independent transitions ac¬ 
cording to the restart distribution Co, rather than simulating a single long trajectory. In fact this is the 
policy evaluation process in the LSPI method mentioned in Section 4.4. The geometric sampling procedure 
described here allows exploration-enhancement for any A. 
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5. CONCLUSIONS 

We discussed a few implementations of A-PI with linear cost function approximation, which have different 
strengths and weaknesses with respect to dealing with the critical issues of bias and exploration. Out of the 
three implementations, the one of Section 4.3, A-PI(l), is new and seems capable of dealing well with both 
issues, although it has worse sample complexity than the A-PI(O) implementation of Section 4.2. 

On the other hand, our discussion has been somewhat speculative, and our assessments, while relying on 
past computational experience, still require supportive experimentation. Moreover, the A-PI implementations 
should be compared to other approximate PI methods based on projected equations, such as the exploration- 
enhanced LSTD(A) method for policy evaluation, discussed in Section 3, and the LSPI method discussed 
in Section 4.4. A computational comparison of A-PI(O) with this latter method is given in [ThSlOa], and a 
similar comparison with A-PI(l) would be desirable. 

Fundamentally, A-PI(l) is based on geometric sampling, a new simulation idea for A-methods that uses 
multiple short trajectories with exploration-enhanced restart, rather than a single infinitely long trajectory. 
This idea can also be applied to LSTD(A), thereby obtaining a new exploration-enhanced version of this 
method, which has been described in Section 4.5. 
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